Points: 10
The UK government has gathered traffic accident data from the years 2000 to 2016, capturing data for over 1.6 million accidents. The data is collected from police reports and does not include fender bender type incidents. In order to reduce the complexity and resource requirements, we will be working on the accidents occured during 2012. The dataset contains a variety of variables related to these accidents.
This dataset is originally from the UK government website but is hosted by BigML Inc. It can be downloaded from the link: https://bigml.com/user/czuriaga/gallery/dataset/525dcd15035d076e7f00e3ac.
From a business perspective, we'd like to establish the association rules that can be derived from the dataset. These association rules can also be referred to as patterns that become apparent through data analysis. Specifically, we are looking at the factors/variables that are related to the severity of an accident. In our analysis, the three different severity levels are slight, serious, and fatal.
For our analysis, we performed association rule learning using the apriori algorithm. To measure the effectiveness of our association rules analysis, we will be looking at the correlation between minimum support, the number of rules derived or frequent itemsets, as well as the data processing time.
Considering the information presented in this dataset, the stakeholders could vary from everyday drivers to first responders to urban planners. This method of analysis and results will provide a better understanding of the factors surrounding different types of accidents, and allow them to re-route away from accident prone areas, plan and prepare for future incidents, or consider the critical factors when creating new roads or structures.
We wanted to use the association rules algorithms in R to derive meaningful correlations between factors contributing to the accidents in the UK. Since the apriori algorithm is more stable in R, we decided to do the entire lab using R and Jupyter notebook in the Anaconda environment.
Points: 10
Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.
We used the same data file for the previous labs. The accident data is in CSV format (UKaccidents.csv) that has close to 300k observations and 55 attributes. We created another CSV file (UKAccidents_uniquecolumnsR.csv) to rename and skip some of the columns that will not be considered for the association algorithms. This CSV contained two columns, one for the old attribute name and the other for the new attribute name. The attributes that we decided to skip will have "Skip" as the new attribute name. We will then programmatically remove those columns from the dataframe.
# Accident data file
accidents.file = '../../data/UKAccidents.csv'
accident.data <- read.csv(accidents.file)
# Excel file with the current and new attribute names. The attributes that needed to be removed are named as "Skip" and
# they will be removed programmatically.
R.column.names.map.file = '../../data/UKAccidents_uniquecolumnsR.csv'
R.column.names.map <- read.csv(R.column.names.map.file)
head(accident.data)
dim(accident.data)
head(R.column.names.map)
As an example, we can see that location attributes will be removed from the dataframes as they may not help us in the association rules unless converted to categorical regions.
dim(R.column.names.map)
We have an entry for each of the 55 attributes in the mapping file.
Some of the records in the dataset appear to be duplicate, but that was not the case once they were observed closely. When two or more vehicles are involved in an accident, there is an instance for each vehicle with the location and other features being the same; however, the driver and vehicle information is different.
Missing values in the dataset are not a problem for doing association rules. When the data is transformed into the transaction sparse format, the missing values will provide a good context for the association rules.
accident.data.temp <- accident.data
names(accident.data.temp) <- R.column.names.map$New
head(accident.data.temp)
library(data.table)
accident.df = accident.data.temp[,!(names(accident.data.temp) %like% "^Skip")]
names(accident.df)
The following columns are to be recoded:
accident.df$EngineCapacity <- cut(accident.df$EngineCapacity, breaks=c(0, 750, 2000, Inf), labels=c("Low","Medium","High"))
accident.df$AgeOfVehicle <- cut(accident.df$AgeOfVehicle, breaks=c(0, 3, 8, Inf), labels=c("New","Medium","Old"))
accident.df$EngineCapacity <- factor(accident.df$EngineCapacity)
accident.df$AgeOfVehicle <- factor(accident.df$AgeOfVehicle)
head(accident.df)
dim(accident.df)
The following plots display the severity of accidents with different factors, while identifying the Pearson residuals. Residuals demonstrate the difference in expected and observed number of each accident severity, while the Pearson residual is the raw residual divided by the square root of the variance.
if(! "vcd" %in% installed.packages()) install.packages("vcd", depend = TRUE)
library(vcd)
assoc(~ Severity + SpeedLimit, data=accident.df, shade=TRUE, abbreviate_labs=6)
In the above plot, we see multiple values where more or less accidents occurred than expected. For fatal severity, significantly more accidents occurred at a speed limit of 60 than was expected, and even so at 50 and 70. There is a similar trend for speed limit 60 and serious severity. The opposite is true for the slight severity, with less accidents occurring than expected at a speed limit of 60, and more accidents occurring than expected at a speed limit of 30.
The overall trend shows that more severe accidents tend to occur at higher speed limits, and less severe accidents tend to occur at lower speed limits.
assoc(~ Severity + Sex, data=accident.df, shade=TRUE, abbreviate_labs=6)
In the above plot, we see multiple values where more or less accidents occurred than expected. For accidents with a fatal severity, more accidents occur than expected for males and less for females. The same trend is seen for serious accidents. The tables turn for slight accidents, with more accidents occurring than expected for females and less for males. The overall trend shows that males tend to be more involved in serious and fatal accidents, while females are more involved in less severe accidents.
assoc(~ Severity + Month, data=accident.df, shade=TRUE, abbreviate_labs=6)
In the above plot, we see a few values where more or less accidents occurred than expected. The largest residual is seen in serious accidents during the month of May. There are also more fatal accidents than expected in December. There are less serious accidents than expected in the months of January, April, October, and December. The slight accidents do not seem to vary too much throughout the year.
There is no overall trend, however we can assert that the number of fatal accidents in December may be due to the holiday rush and/or inclement weather.
Points: 10
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
Support and confidence are generally the primary metrics for evaluating the quality of the association rules generated by the model, and also what we will use as our metrics. Both support and confidence can be used to rank the generated rules, and can therefore determine how useful the predictions are. More importantly, both metrics are used to determine the validity of rules generated.
The support of a rule indicates how frequently the items in the rule occur together. Support is the ratio of transactions that include all the items in the first and second half of the hypothesis to the number of total transactions.
The confidence of a rule indicates the probability of both the first and second half of the hypothesis appearing in the same transaction. Confidence is the ratio of the rule support to the number of transactions that include the first half of the hypothesis.
Lift is another metric than can be used to evaluate the quality of a rule. It is defined as the confidence of the combination of items divided by the support of the second half of the hypothesis.

We have to prepare the data in the Basket sparse format for the apriori algorithm. We followed the steps described below to achieve that:
Read the original CSV file, delete the unwanted columns and rename the columns. We performed this step before the data visualization. We used another manually created CSV file that contained the map of the old and the new column names. We had 265877 observations and 42 attributes.
We created an intermediate transactions CSV file that had each observation transposed into a
While coming up with the unique list of values, we removed attribute values that indicated missing data such as "N/A" and "Data Missing" and blank values.
We used apply and lapply functions in R to get the list of transactions with the unique accident IDs and we stored them in the transactions.csv file. We had a total of 6,491,814 rows with two columns in the transactions.csv file.
We used read.transactions function in arules library to read the transactions.csv into the basket format required by apriori algorithm.
# Excel file containing the unique values for each of the attribute. There will be number of NAs values which will be removed.
unique.value.file = '../../data/UKAccidents_uniquevalues.csv'
unique.values <- read.csv(unique.value.file)
#Remove NAs from unique.values
unique.values <- na.omit(unique.values)
head(unique.values)
# Function to transpose a single row of accident data into a column of factors in the format <factor name> = <value>. This
# function will be called from apply function call which will iterate through all the rows in the dataset.
getRowValues <- function(p.onerow, p.unique.value) {
i <- 1
f <- character()
a <- numeric()
factors <- names(p.onerow)
for(factor in factors) {
#print(p.onerow[[factor]])
if(length(p.onerow[[factor]]) > 0 & p.onerow[[factor]] %in% p.unique.value[, factor]) {
f[i] <- paste(factor, '=', p.onerow[factor])
i <- i + 1
}
}
return(f)
}
accident.factors <- apply(accident.df, 1, getRowValues, unique.values)
length(accident.factors)
head(accident.factors)
library(dplyr)
options(warn=-1)
accident.transactions <- accident.factors %>% lapply(as.data.frame) %>% bind_rows(.id = "ID") %>% mutate(ID = as.numeric(ID))
dim(accident.transactions)
names(accident.transactions) <- c("AccidentID", "Factor")
accident.transactions %>% group_by('AccidentID') %>% summarize(accident.count = n_distinct(AccidentID))
accident.transactions %>% group_by('Factor') %>% summarize(factor.count = n_distinct(Factor))
head(accident.transactions, 20)
tail(accident.transactions, 20)
# write transactions to CSV file
write.csv(accident.transactions, "../../data/accident_transactions.csv", row.names = FALSE, quote = FALSE)
# Read the transactions CSV file
library(arules)
library(arulesViz)
library(ggplot2)
transactions.obj <- read.transactions(file = "../../data/accident_transactions.csv", format = "single",
sep = ",",
cols = c("AccidentID", "Factor"),
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
transactions.obj
In the sparse basket format, there are 265,877 rows and 199 items. The rows correspond to the original number of observations and the columns are factors with their levels.
Association rules analysis does not involve the standard forms of training a dataset that has been performed in the previous labs. However, it does involve specific formatting as detailed above and modeling with the Apriori algorithm. In this section, the Apriori algorithm is implemented and rules are displayed.
Association rule mining can be divided into a two-step approach. First is frequent itemset generation, which involves finding all combination of items in a set of transactions that occur with a specified minimum frequency. The second step is rule generation, which involves calculating the rules that express the probable co-occurence of items within frequent itemsets. By using the Apriori algorithm, the probability of an item being present in a frequent itemset is calculated, given that another item or items is present.
as.data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = TRUE), 10) ) # Most frequent
as.data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = FALSE), 10)) # Least frequent
itemFrequencyPlot(transactions.obj, topN = 25)
The plot above displays the most frequent items in the dataset. The first bar indicates that accidents usually involve right hand drive, which makes sense for the UK. The second bar indicates accidents occur in places where there is no pedestrian crossing control within 50 meters, and the third that the accidents occur on main crossways. A couple more interesting items are: light condition of daylight, sex of male, urban area, speed limit 30, and vehicle maneuver of going ahead. One bar is important to discuss, and that is severity of slight. The data has significantly more instances of slight accidents than serious and fatal, but we attempt to generate rules for all with the values used as inputs/parameters.
# Interest Measures
support <- 0.01
# Frequent item sets
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets"
)
freq.items <- apriori(transactions.obj, parameter = parameters)
# Let us examine our freq item sites
freq.items.df <- data.frame(item_set = labels(freq.items), support = freq.items@quality)
head(freq.items.df, 5)
tail(freq.items.df, 5)
# Let us now examine the rules
confidence <- 0.2 # Interest Measure
parameters = list(
support = support,
confidence = confidence,
minlen = 4, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "rules"
)
rules <- apriori(transactions.obj, parameter = parameters)
rules.df <- data.frame(rules = labels(rules)
,rules@quality)
head(rules.df)
tail(rules.df)
rules
The rules generated above utilized the following parameters: support value of 0.01 and confidence value of 0.2. These rules are more general, because the next section focuses on rules for each severity level. The lift value denotes the performance of the model at predicting or classifying measured against random chance. It can more simply be broken down to the ratio of the target response divided by the average response. A lift values less than 1 denotes a negative association, while a value greater than one denotes a positive association and likely more useful rule.
In this case, from the head we can see one rule that sticks out with a lift greater than 1. That rule is: {Severity = Fatal} => {DidPoliceOfficerAttend = Yes}. This would make sense, that a police officer would attend the scene of a fatal car accident. No rule truly stands out in the tail, as a lift value of 1 generally implies the probability of occurrence of the antecedent and consequent are independent.
rules.sorted <- sort(rules, by='lift')
inspect(rules.sorted[1:10], n=10)
A simple plot of the first 1000 rules are shown in following plot. More detailed plots will be discussed in the following sections.
plot(rules.sorted[1:1000], method = NULL, measure = "support", shading = "lift", engine='plotly')

In this section we wanted to compare the association rules for the various accident severities (Fatal, Serious and Slight). The dataset is unbalanced in the number of observations for each of the accident severities as shown in the bar plot below. Therefore, we had to tune the confidence and the lift parameters to the apriori algorithm individually for each of the severity to get the relevant rules.
# counts
g <- ggplot(accident.df, aes(Severity))
g + geom_bar()
serious.rules <- apriori(transactions.obj, parameter = list(minlen=3, supp=0.00047, conf=0.3),
appearance = list(rhs=c('Severity = Serious'), default='lhs'), control = list(verbose=F))
serious.rules
serious.rules.sorted <- sort(serious.rules, by='lift')
inspect(serious.rules.sorted[1:10], n=10)
The above are rules generated using the Apriori algorithm for accidents with a serious severity. Seeing as the rules are sorted by lift, we are seeing rules with a lift significantly greater than 1, indicating positive associations and more useful rules. Most of the visible rules invole a low engine capacity, vehicle leaving the carriageway offside, and some other factor contributing to a severity of serious. It is also important to note the count on the right side, showing how often this rule appears in the dataset, with all the visible rules appearing over 150 times. These rules were generated with a support of 0.00047 and confidence of 0.3.
fatal.rules <- apriori(transactions.obj, parameter = list(minlen=3, supp=0.00001, conf=0.4),
appearance = list(rhs=c('Severity = Fatal'), default='lhs'), control = list(verbose=F))
fatal.rules
fatal.rules.sorted <- sort(fatal.rules, by='lift')
inspect(fatal.rules.sorted[1:10], n=10)
The above are rules generated using the Apriori algorithm for accidents with a fatal severity. Seeing as the rules are sorted by lift, we are seeing rules with the largest lift values. Most of the visible rules involve carriage way hazards of a previous accident, junction detail of more than 4 arms, some have a speed limit of 70, and fog or mist weather conditions. Based on these factors, the rules do seem likely to involve fatal accidents. One would assume that bad weather would play a factor in driivng conditions, and driving fast would result in a more severe incident. It is good to note that the hazards show previous accidents had occured at these locations. These rules were generated with a low support of 0.00001 and confidence of 0.4.
slight.rules <- apriori(transactions.obj, parameter = list(minlen=3, supp=0.6, conf=0.8),
appearance = list(rhs=c('Severity = Slight'), default='lhs'), control = list(verbose=F))
slight.rules
slight.rules.sorted <- sort(slight.rules, by='lift')
inspect(slight.rules.sorted[1:10], n=10)
The above are rules generated using the Apriori algorithm for accidents with a slight severity. Seeing as the rules are sorted by lift, we are seeing rules with a lift close to 1, which may or may not indicate associations usefulness. Most of the visible rules invole a location on a main road, right hand drive, daylight conditions, and some other factor contributing to a severity of slight. It is also important to note the count on the right side, showing how often this rule appears in the dataset, with all the visible rules appearing over 166,000 times. These rules were generated with a support of 0.6 and confidence of 0.8. If the support value is too low, there will be numerous rules, since there were more slight accidents in the dataset.
There were 46 rules generated for the accidents with a slight severity. The rules are plotted above demonstrating the evaluation metrics of support and lift. The darker colors demonstrate a higher lift, and those rules tend to have lower support.
# Scatter plot of rules
plot(serious.rules.sorted, method = "scatterplot", measure = c("support","lift"), shading = "order", jitter=0, engine="plotly")

# Interactive scatter plots
plot(serious.rules.sorted, method = NULL, measure = "support", shading = "lift", engine='pl')

plot(head(sort(serious.rules.sorted, by="lift"), 200), method = "grouped")
plot(serious.rules[1:15], method = "paracoord")
The visualization of the rules for serious accidents show that the light conditions, speed limit and deprivation play a major role.
# Scatter plot of rules
plot(fatal.rules.sorted, method = "scatterplot", measure = c("support","lift"), shading = "order", jitter=0, engine="plotly")

# Interactive scatter plots
plot(fatal.rules.sorted, method = NULL, measure = "support", shading = "lift", engine='pl')

plot(head(sort(fatal.rules.sorted, by="lift"), 100), method = "grouped")
plot(fatal.rules[1:10], method = "paracoord")
The visualization of the rules for fatal accidents show that the light conditions, speed limit and deprivation play a major role. They also show that the males are more prone to be involved in fatal accidents than females.
# Scatter plot of rules
plot(slight.rules.sorted, method = "scatterplot", measure = c("support","lift"), shading = "order", jitter=0, engine="plotly")

# Interactive scatter plots
plot(slight.rules.sorted, method = NULL, measure = "support", shading = "lift", engine='pl')

plot(head(sort(slight.rules.sorted, by="lift"), 100), method = "grouped")
plot(slight.rules[1:10], method = "paracoord")
The visualization of the rules for slight accidents show that all the factors can come into play in the cause of the accidents.
plot(head(sort(serious.rules.sorted, by="lift"), 25), method="graph", control=list(cex=.7))
Based on earlier analysis, we saw the trend that a lower value of support and larger value of confidence resulted in the highest lift values. This graph demonstrates those rules with a deeper opaque color. These rules maintain the analysis, as the smaller circles represent lower support values. Some useful rules involve fine and daylight environmental conditions, vehicle leaving the carriageway offside, vehicle location on a main roadway, and a low engine capacity.
plot(head(sort(fatal.rules.sorted, by="lift"), 25), method="graph", control=list(cex=.7))
This plot is extremely unique in that all the rules shown seem to share the same support and lift values. However, I think this is due to the effect of having very small parameter values. The support value used in rule generation was 0.00001 for fatal accidents, which seems to have been rounded to 0 for the purposes of the graph. In this case, we will defer to the results seen above in the rule generation. Due to there being such few data points for fatal accidents, a small support value was used to generate rules.
plot(head(sort(slight.rules.sorted, by="lift"), 25), method="graph", control=list(cex=.7))
This plot also supports the above assertion that a lower support value and larger confidence value results in higher lift values, demonstrated by the small opaque circles. Some of the recurring factors in the generated rules for slight accidents are: vehicle type car, right hand drive, no pedestrian crossing within 50 meters, and vehicle location on a main roadway. By just looking at these rules, some seem to be a bit implied, such as the vehicle being a car and having right hand drive in the UK. This is in part due to the large number of slight accidents in the dataset, which is the majority. For the slight analysis, a higher support value of 0.6 is used to reduce the number of rules generated.
Points: 10
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
The goal of this analysis was to establish the association rules that can be derived from the UK Accidents data from 2012. Specifically, we wanted to examine the factors in the rules that are related to the severity of the numerous accidents. In this respect, our goal was acheived in generating the varios rules associated with the slight, serious, and fatal severities with the associated statistics to determine usefulness.
The rules generated are absolutely usable by various stakeholders, especially those rules that are deemed more useful based on the evaluation metrics. Normal drivers that are concerned with safety of routes can use the information to decide what types of junctions or intersections to avoid on commutes. First responders and health professionals can get an idea of what factors lead to different severities of accidents, and better prepare in those areas that could be considered high risk based on the association rules. Planners and developers can utilize the information from this analysis to better understand the specific factors surrounding different types of accidents and work to avoid them as best possible. The association rules not only provide information into the severity of accidents, but the factors surrounding them; there are countless parties that could benefit from and utilize the findings.
This model would be best deployed as an online resource or in a consulting capacity. As an accessible resource, this would allow a variety of parties to utilize the information to meet their needs. For industries or companies looking to fully understand accident severity in the UK, they could hire us on as consultants to analyze in more depth and create visualizations based on specific scenarios. The assocaition rules could also be altered to meet the need, moving beyond severity, and diving into the subsets of different factors.
To further improve the models, more data is always helpful in drawing conclusions on usefulness for the generated association rules. More data regarding serious and fatal accidents would allow for the same sense of confidence as in the slight accident results. With additional data collected on the type of accident, such as fender bender or head on collision, that could provide for more granularity in analysis of the assocation rules.
For reliable results, the model should be updated every quarter or 3 months to ensure the most recent data is factored into the associaton rules. This up to date information provides more assurance in the results and conclusions drawn.
The observations for the accidents have the latitude and logitude coordinates associated with the place where the accident occured. We used google maps to plot the accidents for the entire United Kingdom and also some specific cities to see how the accidents are spread out in the various areas.
In order to use google maps in our code, we had to follow the following steps:
options(warn=-1)
library(ggmap)
library(ggplot2)
register_google(key="AIzaSyBF0uwfKOPUzLJ8l1DKaaSi5fX6mY65xlA")
UKMap <- get_map(location = c(lon = -3, lat = 54), color = "color", source = "google", maptype = "terrain",
zoom = 6)
UKggMap <- ggmap(UKMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
UKggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity),
data = accident.data)
The entire map shows the slight accidents spread out throughout the entire country, the serious in some concentrated places and the fatal ones near the cities.
LondonMap <- get_map(location = c(lon = -0.16179, lat = 51.538525), color = "color", source = "google", maptype = "terrain",
zoom = 12)
LondonggMap <- ggmap(LondonMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
LondonggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity),
data = accident.data)
The map of London shows the slight accidents in most of the main roads. They seem to occur closer to the city center or downtown.
ManchesterMap <- get_map(location = c(lon = -2.244644, lat = 53.4808), color = "color", source = "google", maptype = "terrain",
zoom = 12)
ManchesterggMap <- ggmap(ManchesterMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
ManchesterggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity),
data = accident.data)
Manchester map shows accidents concentrated on the city center.
BirminghamMap <- get_map(location = c(lon = -1.898575, lat = 52.48947), color = "color", source = "google", maptype = "terrain",
zoom = 12)
BirminghamggMap <- ggmap(BirminghamMap, extent = "device", ylab = "Latitude", xlab = "Longitude")
BirminghamggMap + geom_point(aes(x = Longitude, y = Latitude, colour= Accident.Severity, fill = Accident.Severity),
data = accident.data)
Birmingham shows the accidents are evenly spread out on all parts of the city.
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
MSDS 7331 - Lab Three: Clustering, Association Rules, or Recommenders, Investigators: Matt Baldree, Ben Brock, Tom Elkins, Austin Kelly